Project: Investigate a Dataset; using Gapminder to correlate democracy, equity, and education

Table of Contents

Introduction

Doomscrolling through my news feed, I see a pile-up of global and national crises, decades and centuries in the making. Perhaps the most pressing crises we face today are ecological. Everything we have stems from life on this planet. Even if we go to Mars, there are no economies, no politics, no humanity without a biosphere from which to derive them. Yet, it seems that humanity is all that stands between us and a mostly humane, non-catastrophic future, and that window of agency may be rapidly closing.

That's why, given this course's choice of datasets, I chose to investigate GapMinder's. GapMinder pulls data from various sources about the global state of affairs. It is a self-ascribed "fact tank, not a think tank" with facts about things like economics, education, population, environment, etc. Specifically, I chose to select indicators to look into the relationships between democracy, education, and equity.

The main purpose of this project is to extend and showcase my exploratory data wrangling and Python skills, not to produce any rigorous research or analysis.

While I won't be demonstrating any causal correlation between these broad categories, my hunch going into the investigation was that there exists an interdependent relationship among all three axises (democracy, education, and equity): an equitable democracy depends on educated citizens; inequity inhibits proper education of the citizenry as a whole and tips the scale of the democratic process; democratic engagement in the public interest ensures equity writ large and equity in educational opportunity. In other words, there is no x factor. And, in such a systemic relationship, you may find that you can affect all parts by affect one part, and you may also find that you can't significantly affect one part without affecting them all.

Testing these huge assertions is well beyond the scope of this project, but a reasonable starting point is to ask whether we can find any such correlation between specific measures of democracy, education, and equity. That is what I ask here.

What I found was that there were far more non-correlated indicator pairs across axises than there were correlated, despite overlap in the constructs and indicators (e.g. some measures of democracy include measures of equity, by definition). In an effort to limit this leakage between constructs, I limited indicators on the equity axis to indicators socioeconomic equity. The outcome was that democracy was the most self-correlated axis of indicators, while socioeconomic equity was the least self-correlated axis, indicating the relative stability and coherence of the constructs and selection of indicators in these axises. The democracy and education axises were most correlated. No rigorous methodology was applied to the selection, grouping, and analysis of these indicators.

For a second research inquiry, I wanted to know what indicators do correlated strongly across axises. I selected three indicators that correlated strongly with multiple indicators across axises. They were measures of gender equity, government functioning, and poverty. I found that there existed correlations between them.

Data selection

Measures of democracy, education, and equity are many. They are also often overlapping, causing leakage. Some measures of democracy include measures of gender equity, for instance, and some measures of equity include measures of educational access.

I mitigated leakage primarily by narrowing equity to socioeconomic equity, using indicators such as distribution of wealth and income. I didn't include health and healthcare indicators here, though they are arguably within the domain of socioeconomic equity, in order to avoid overlap with democratic indicators and other possible constructs beyond the three axises I have chosen. I did however include some measures of unemployment and government insurance because they are more directly tied to income and socioeconomic security, but mostly out of curiosity.

Gender parity is a huge theme in Gapminder datasets. I opted to exclude indicators that measured in terms of one sex or the other, and included only measures of the population as a whole or age groups. I did this to limit my focus on socioeconomic equity, and to avoid overlap with democracy. I included at least one broad indicator of gender equity in general as an indicator of democracy.

A more rigorous selection and labeling process is certainly warranted by the research question, but not by the project scope and assignment rubric.

Holding the reigns of my education, I chose a big enough question to hold my interest, and to extend and strengthen my Python skills. I have knowingly set some statistical analysis tasks that might be better accomplished with other methods, and indeed might be considered reinventing the wheel. But, in doing so, I have deepened my statistical understanding and improved my Python and coding skills in such a way that I am more ready to move onto more sophisticated methods and tools.

Data Wrangling

Extraction and Loading

I want to work with a lot of different tables from the Gapminder set, so rather than reading them into dataframes one by one, I downloaded them to a folder and wrote a function to read them into a dictionary of dataframes.

There may have been an easier way to download the tables* in bulk, but I wanted to browse through them on the site and choose which to download anyway, so it was just as easy to download them one at a time. However, this choice leaves all metadata on the Gapminder site, so looking up descriptions of each indicator has to be done online. A quick internet search for the file name (without file type suffix) typically turns up the original data source. For this project, this more than suffices.

I also needed to label these indicator tables as belonging to one of my target axises: democracy, education, and socioeconomic equity. One approach might be to create an indicator class that has a dataframe and a label, but I want to put the indicators in a table for easier grouping by their label. So, I created a pandas series of labels, indexed by indicator names (file name without the file type suffix). I created a dictionary of dataframes for the indicator tables, with keys identical to the label series index. When I want to select dataframes by their labels, I can select the keys from the label series index.

I orginally wanted to put the indicator dataframes into a series as well and join it with the labels series into a dataframe. But, I ran into problems with operating on the dataframes within the dataframe (e.g. adjusting for population). Plus, dict autocompletes key values, whereas pandas doesn't autocomplete index or column names. I might lose a marginal amount of processing speed with a dict, and perhaps some coding maneuverability, but it's a compromise I can live with.

* Note: I use "tables" generically to refer to 2D data structures with rows and columns. The table I'm talking about might be a CSV file, a pandas dataframe, a numpy array, etc., or even a pandas series.

General properties

Scrolling through the dictionary of tables, it appears that all tables have years for columns and countries for an index. They all appear to be two-dimensional and contain numerical values. They don't all share the same shape, some having more years and/or countries than others.

We do find that tables from the same organization tend to have the same shape. For instance, the Economic Intelligence Unit (EIU) democracy indicators* share the same countries and years, though some records may contain missing values.

* EIU calls them indices, but I don't want to confuse terms when speaking of table indices, so I call them indicators -- though the words share a root.

I created a better describe table, actually two. I add pandas.DataFrame.skew and pandas.DataFrame.kurtosis to the pandas.DataFrame.describe dataframe. To create a second table of summary stats of the summary stats (to compare columns to the table as a whole), I transpose the augmented describe table and describe it. It's not so necessary here, but I use it later in the analysis. I'll test drive it here to check out a couple of democratic indicators from EIU.

Labeling all the indicators was tedious, but there was no way around it. I have classified most of the indicators as indicators of socioeconomic equity.

Data cleaning

Most of the tables are scores or percentages (e.g. percentage of wealth shared by the poorest 20%), in the 0-1 and 0-100 ranges, which makes for easy correlation with each other. Other tables are gross populations (e.g. total number of children enrolled in primary school), which also makes for easy correlation with each other. But, correlation between the groups requires controlling for population. So, I need to determine which tables are gross populations and convert them to a percentage of population.

There is a population table spanning from the 1800s to the end of this century. When selecting the tables, I carelessly did not note which tables are gross populations, so I want a quicker way to find out which tables need adjustment without going back and manually looking at each one. To do this, I can find all the tables with max values above 100. There might be some gross tables that don't meet this criteria, so I can isolate those tables with max values below 100 and are not 1 to manually inspect to decide which group they fit into.

Of these tables, some values are actual population number, and others are stated as millions of people. Since I'm only going to be looking at correlations, this might not pose an issue. But, I am concerned that the extremely small values that will result from adjusting for population might push the limits of float type accuracy. So, I will need to identify which gross population tables are not actual population numbers and adjust accordingly.

With that adjustment, all tables should have values relative to their country and year, and thus we can draw correlations.

There are many NaNs and some zeros in a lot of the tables. I chose to leave zeros as zeros when reading the files into dataframes because many of the tables contain both NaNs and zeros. This, coupled with the fact that these tables come from Gapminder where they have already been cleaned up, suggests that each NaN and zero is intentional. I may need to handle missing values later in the analysis.

That said, some tables have no NaNs and some zeros, some with many zeros. That raises some concerns as the many different sources may have used different methods for handling missing values, and Gapminder may not have applied a standard. Some tables may represent missing values with zeros, and others may use NaNs. Or, there may simply be tables with no missing values. I have not found any guidance on this on the Gapminder site. If I needed to be certain, I could contact someone at Gapminder or look at how others have handled this. For now, I will just keep an eye out for anything fishy, like indicators that don't comp well with any other indicators at all.

I'll start by isolating gross population tables to adjust them to proportions of population.

Looking a few of these up online, I can see right off the bat that many of the ambiguous tables contain percentages. I can remove the names that contain 'per'. I can also see that some of these are range-bound scores. I can remove the names that contain 'index', 'score', and 'gini'.

Okay, this looks like it might be a lot of busy work, but there are really only a few groups to manually check. Looking up these tables online, I see a few other substrings indicating percentages/scores, and some indicating duration in years and ages.

Okay, that should be all the tables with gross or net values that are not percentages or scores or otherwise not population related. But, I happened to notice that se_sec_enrr, which I explicitly removed from the ambiguous list, was added in with max_over_100_lst. This table is an enrollment ratio table and shouldn't be adjusted for population. Plus, I recognize a removal substring in some of the names. I'm going to go ahead and manually look up the max_over_100_lst tables. There aren't many left.

Now, I need to double check online to see if the values in each table are actual values, or values in millions -- or something else.

And, it looks like only one needs to be multiplied, number_of_people_in_poverty.

Looks like it worked. Now, I need to adjust for population.

I'll change the keys of the tables I adjust in order to keep track of it. That means I'll need to change the index of indicator axis labels accordingly.

Because the population table had a different shape than the people in poverty table, dividing by the population table reshaped the poverty table, adding a lot of years and countries, which brought in a lot of NaNs.

But, the operation appears to have worked on a sample column, and our sample table has the same number of observations as before.

Let's trim those extra rows and columns with all NaNs. In fact, let's do this for all the tables. This might save compute later.

Okay, let's dig in!

Exploratory Data Analysis

Package preview: cufflinks, iplot()

Cufflinks is wrapper of Plotly.plot enabling you to use it on Pandas Dataframes and Series'.

Examples: https://github.com/santosjorge/cufflinks/blob/master/Cufflinks%20Tutorial%20-%20Pandas%20Like.ipynb

Documentation: https://github.com/santosjorge/cufflinks https://github.com/santosjorge/cufflinks/blob/master/Cufflinks%20Tutorial%20-%20Plotly.ipynb

I want to test it out on the EIU democracy indicator. The iplot function alone has a long list of optional parameters, but playing with the kind parameter, I found a neat interactive 3d surface map that gives you a sense of the varability of the data overall as well as a way to peak at each country over the years. Hover over the upper-right corner to change the way you interact with the map.

Something that bothers me about iplot (other than the fact that it's axis titles parameters don't work) is that I didn't find a way to easily plot along the columns (e.g. axis=1). I used a the surface map above to get around this, but I'd like to see a table as line charts for each row, that is, for each country on this table. I have to transpose the dataframe then plot it.

It gave me a very busy line chart since there are 164 countries, but something jumps out at me: countries with lower democracy scores tend to have more unstable scores over the years than those with higher scores. Another way to say that is that those countries with stonger democracy scores tend to stay that way.

There's not enough information here to determine why that is the case. We can't say whether democratic conditions create the conditions for democracratic conditions to continue, or whether there's some other underlying factor. It just simply appears to be the case that strong democracies tend to stay strong democracies, in the timespan we're observing.

We can see the same feature a little more dramatically by comparing scatter plots of the years compared to other years.

It would be tedious to chart all 196 comparisons of the 14 years, so let's look at the earliest year compared to the latest year and some years in between.

You can watch the correlation weaken as time passes. In other words, without drawing any true predictive conclusions, it appears that the strength of a country's democracy in one year might predict the strength of its democracy in following years, but less so the further forward you look.

Notice the way the later scatter plots seem to loosen up more around their bottoms than their tops. Again, without drawing any conclusions about causation, this suggests that countries with weaker democracies may be more likely to improve or degrade their democratic features. Or, full-fledged democracies may be more likely to stay that way.

Also, notice how iplot draws scatter plots with lines instead of dots. I used Plotly Express instead.

This might be worth revisiting in a more rigorous analysis beyond the scope of this exploratory project. You might start by dividing the countries into those with a mean score above the table mean score and those with a mean score below, then compare the variance of the two groups, and do the same for other measures of democracy. Again, this is beyond the scope of this project.

Let's move onto the main research question regarding the larger set of tables.

Are democracy, education, and socioeconomic equity correlated?

To get at this question, I need to be able to compare entire tables to each other, to correlate point by point at a scalar level. I found no direct way to correlate an entire 2d dataframe with another, only column to column or row to row.

So, to compare two indicators (i.e. two dataframes loaded from the csv files), I need to convert each indicator into a single column or series. Furthermore, each column needs to have the same shape in order to satisfy correlation function requirements. They need to share an index with the rest, or at least string the values in the same order, to compare the same (country, year) fields. Also, it would be best if they were all on the same table to make for easier operations.

Below, I do that. The resulting dataframe is such that each column is the values of a single indicator, and the index is a Pandas MultiIndex of years and countries.

More cleaning

It looks like there's an indicator with only 2 values (see the super describe table [min, count]). I want to remove that indicator, and possibly others with few values. I'll set a relatively low minimum value count of 30 and remove all tables with fewer observations than that. I might find later that I need to raise the bar if I run into problems.

And, I'll strip any rows that ended up with all NaNs after removing columns. I'll also need to remove from the labels series any indicators I removed from the dataframe.

I also see that my maximum value is 2,490. I wasn't anticipating values this high, but it's not necessarily a bug. Although, I do want to look more closely at indicators with high max values, say over 300.

We've got a couple of tables of test scores that I recall from looking at them are in the hundreds, so those are fine.

The other table is the table with the highest value, and it's government expenditure per tertiary student, expressed as a percentage of GDP per capita.

2,490% seems like a lot. You can see below that it's many IQRs above the third quartile. I should remove outliers from this and other indicators.

First, I want to see which country this is!

Yes, let's remove outliers from this indicator and any others that need it. I will arbitrarily remove values with z-scores greater than 3. For another project, I might need to develop a more rigorous algorithm for setting the z-score limit for each indicator. For instance, I might use the IQR. But, I use quantile() elsewhere in this analysis, so I want to try out zscore() here.

Seems to have done the trick.

Back to exploration!

I'd love to throw this table of indicators at pandas_profiling, but it proved too much for my 32GB of RAM. We'll hold off for a smaller selection of indicators.

We're ready to find the correlations between all these indicators and get a bird's eye view of the relationships with a heatmap of the r values. I'll arbitrarily set the minimum number of observations needed for correlation to 20.

Here's another busy plot. There are some big-picture features to notice that might direct our inquiry further, though.

There's the expected diagonal where r=1 because it's self-correlation. And, there a are a handful of NaN points where the correlated indicators did not have enough intersecting data points to create a significant r value.

The heatmap is also noticably criss-crossed with horizontal/vertical lines (representing the correlations of a single indicator) of low-to-moderate correlation. The indicators least correlated to most of the rest of the indicators seem to be grouped together with others from the same sources representing similar domains, such as data indicators from the World Bank about the duration and ages in educational levels (e.g. se_prm_durs, primary education duration in years).

Pockets of high correlation (positive or negative) jump out, too. These are typically similar or logically related indicators. For instance, multiple orgs (IDEA, EIU, etc.) have produced baskets of indicators of different aspects of democracy (e.g. political participation, free and open elections, etc.). These are all highly correlated, even across sources. This confirms that Democracy is in fact a coherent, cohesive construct. Indicators of distribution of wealth are also highly correlated to each other, and must be, by definition. Though these high correlations are mostly unsurprising, looking closer at the nature of these relationships (e.g. linear or not?) may prove interesting for another study.

Looking at this heatmap, I don't see a lot of evidence for my initial belief that democracy, socioeconomic equity, and education are interdependently correlated. In fact, I see a lot of evidence to the contrary. Multiple educational indicators are not correlated to much of anything else, while multiple measures of income distribution are not correlated to democratic indicators. Seeing this, I definitely wouldn't make blanket statements about the connections between such broad constructs like democracy, equity, and education.

I do want to look more closely as a point of open interest (not to seek confirmation for my initial bias). Let's look at these axises of indicators (i.e. democracy, socioeconomic equity, and education). Since I'm more interested in the existence of correlation than the direction of correlation, I'll take the absolute value of the r-scores to make them easier to work with. If I want to know the direction of a particular correlation, I can still look it up.

Absolute values of r-scores

We can see that most of the lower two quartiles of the absolute values of r-scores are significantly low enough to consider their indicators uncorrelated (i.e. absr_super_desc[75%, 50%] = 0.36). That is, most indicators don't correlated well with most other indicators. And, only a portion of the upper quartiles pf r-scores are high enough to consider their indicators strongly correlated (see absr_super_desc[max, 75%] = 0.74).

Before we compare indicators across axises, let's have a quick peak at the heatmap of the absolute value of r-values.

The pockets of correlation and axes of noncorrelation are clearer now.

Let's group indicators.

Axises of indicators

The tables above are summary stats of the summary stats of each "view" of the table of absolute values of r-scores. The views are selected columns (based on the label given them) and all the rows. So, the views show each axis of indicators' correlations to the rest of the table.

Just as in the table as a whole, each set of indicators only have a few members that even have a portion of their upper quartile of correlation values that show a moderately significant correlation to another indicator (see [max, 75%] == 0.7+/-). Also like the whole set of indicators, most indicators in each set have roughly half of their r-scores low enough to say there's no correlation. Looking at the skew and kurtosis columns (and rows), we see more variation between sets, meaning some axises of indicators may be more correlated to other axises of indicators than others. But, we have to keep in mind that these tables include self-correlations, and set sizes differ.

Let's separately correlate the axises with each other, and with themselves. We'll take a look at the axis correlations as a whole, then look at especially high and low correlations.

I'll use surface maps rather than heat maps to plot the tables. It allows the relationships to be quickly grokked by those who are color oriented as well as those who are spacially oriented. The maps are interactive, so you can rotate them to view from directly above or below and simulate a 2D heatmap. Unfortunately, though iplot takes axis label parameters, it is not including them in the output as of this execution.

Before we pull out some exemplar and exceptional indicator pairs and wrap up, there are some things to note about the surface maps above.

The democracy axis looks highly self-correlated overall. The construct seems fairly coherent and cohesive, even accross sources. It's worth noting that a few of these indicators are actually composites of some of the other indicators (e.g. demox_eiu). There are a couple of exceptional indicators that form a tic-tac-toe grid of low correlation.

The socioeconomic equity axis is largely not self-correlated, except along the axis of direct self-correlation and correlation among similar indicators from the same source (i.e. similarly prefixed names, thus alphabetically proximal). This low degree of self-correlation makes it hard to view the axis as a stable construct. There are a some patches of high correlation among definitively related indicators (e.g. between indicators of income distribution, or between indicators of unemployment and unemployment insurance). I would say that my loose and casual method of selecting these indicators and defining the axis is the major driver for such poor construction. This axis probably needs to be deconstructed as a set of axises in and of itself, which I'm not going to do.

The education axis is more self-correlated, but not as much as the democracy axis. It's criss-crossed with a number of low-correlating indicators. If I wanted to (de/re)construct this axis, I might start by pulling a list of these low-correlating indicators and looking more closely.

The democracy axis and socioeconomic equity axis are mostly not correlated, but there are some striking peaks and axises of high correlation, for instance between an indicator of government efficacy and an indicator of poverty. In fact measures of poverty and mean consumption seem to be fairly correlated to measures of democracy as a whole.

The democracy axis and education axis have some notable peaks and indicators. For instance: some measures of educational attainment and enrollment are moderately correlated with the democracy axis, no doubt partially due to gender equity being included as a measure of democracy, leading to more students as a whole; some measures of educational expenditure do not correlated well to the democracy axis. The basic welfare indicator from IDEA is fairly correlated to education indicators overall, which is not surprising since it includes a few measures of education.

The education and socioeconomic axises are similarly related, For instance, educational attainment and enrollment are correlated to overall income per capita and poverty indicators.

Research Question 2: what are some exemplar and exceptional indicators and correlations across axises (or lack thereof)? Gender equity, government efficacy, and poverty

To wrap up, let's pull out some notable indicators and take a closer look. I'll grab the indicators with the highest and lowest correlations across axises. Then, I'll take a look at a few indicators that caught my interest in the surface maps above.

It looks like gender equity (gendereq_idea), government functioning (gvtx_eiu), and poverty (si_pov_mdim_xq) are all correlated.

First let's get that list of indicators.

Again, we have a lot more extreme absence of correlation across axises than we have high correlation. Every indicator is uncorrelated with at least one other indicator from a different axis, but only a handful are highly correlated with another correlated from a different axis

Let's look at our trio of indicators: gender equity (gendereq_idea), government functioning (gvtx_eiu), and poverty (si_pov_mdim_xq).

There is some leakage between the poverty indicator and the gender equality indicator; access to schooling is included in both indicators, and when half the population has limited or no access based on their gender, both of these indicators will be affected. But, educational access is only a single point in each of these indicators' constellation of sub-indicators. I'm not sure the educational access link is enough to explain a correlation as strong as we see below. There may be other underlying factors, and/or there may be a logical link between a empowering all of society rather than just half of it and the economic productivity and power of its members. I am not going to draw any statistical conclusions, though.

There is also some leakage between the gender equity indicator and the government functioning indicator. One of the 14 points of the government functioning indicator has to do with special groups, like religious groups that may be patriarchal, having "significant political power, parallel to democractic institutions." Again, I am not sure that point of overlap in indicators is enough to explain their correlation. In fact, three of the five points of the gender equality indicator are to do with inclusion of women in government and civil society organizations. In other words, the more women we see in government, the better it tends to run, and vice-versa. Again, I can't draw a statistical conclusion about causation.

Interestingly, I see no direct leakage between definitions of the government functioning indicator and the poverty indicator. A major underlying factor may be the resources and global economic context of the country in question. It takes national wealth to run a country well.

Conclusions

I grabbed baskets of indicators relating to democracy, education, and socioeconomic equity. I found that mostly there was a lack of correlation between these three axises, contrary to my expectation. There some notable correlations, however, including between gender equity, government efficacy, and poverty.

There were profound problems with my casual data selection and grouping method (if you could call it a method). Not the least of these problems was the fact that many of the indicators overlapped by definition with indicators in a separate axis (i.e. democracy, education, equity). Democracy seems to have become a far-reaching construct incorporating so many aspects of a healthy society as to reveal the paradigmatic bias of the society that constructed it. Stable though it is in its self-correlation, the definition of democracy might need to be narrowed to methods and structures of political power and government function in order to draw more meaningful correlations between it and other constructs like education and socioeconomic equity that currently overlap with democracy, causing leakage in the correlations. My casual choice of indicators of socioeconomic equity was apparent in the way the axis was largely not self-correlated.

I made heavy use of pandas data structures and operations. Over the course of this project I 'leveled up' my Python coding, especially with regards to functional programming.

I also test drove cufflinks and pandas_profiling. I found them both to be buggy. When pandas_profiling didn't crash my browser, it often omitted many values and graphics.

Resources used

I made heavy use of pandas, numpy, and Python documentation, as well as Stack Overflow, Geeks for Geeks, Real Python, and other instructional blogs, too many to count and too inconsequential to include. I did not copy any code nor make use of any unique or proprietary ideas. I only gleaned how to code better and solve my problems. To cite them would be like citing books about the English language that I have read simply because I am writing in English and using words I first saw in them.